Skip to content

admin: Support creating incidents and adding serials#8740

Open
beautifulentropy wants to merge 1 commit intomainfrom
admin-tool-sa-incidents
Open

admin: Support creating incidents and adding serials#8740
beautifulentropy wants to merge 1 commit intomainfrom
admin-tool-sa-incidents

Conversation

@beautifulentropy
Copy link
Copy Markdown
Member

@beautifulentropy beautifulentropy commented Apr 29, 2026

Add a StorageAuthorityAdmin gRPC service and four admin subcommands that move incident management into the admin tool. Register the service only for the admin client and write through a new incidents_sa_admin MySQL user; the existing incidents_sa user stays SELECT-only.

Incidents can impact tens to hundreds of millions of certificates, and operators may load serials from multiple overlapping time spans during the affected window. admin load-incident-serials reads serials from a file, fans them out across N workers (default 10), and streams them to the SA in batches of 10,000. The server re-batches at the same threshold and uses INSERT IGNORE on insertion, so overlapping bulk loads and within-batch duplicates are idempotent.

The remaining subcommands cover CRUD on the incidents table:

  • admin create-incident adds a row and creates the per-incident serials table.
  • admin list-incidents prints each incident's name, enabled state, renewBy, and URL.
  • admin update-incident changes any subset of url, renewBy, and enabled on an incident by name.

Fixes #6943

@beautifulentropy beautifulentropy force-pushed the admin-tool-sa-incidents branch 7 times, most recently from da7c02c to 5392654 Compare April 29, 2026 20:59
@beautifulentropy beautifulentropy force-pushed the admin-tool-sa-incidents branch from 5392654 to 24c16e5 Compare April 29, 2026 21:01
@beautifulentropy beautifulentropy marked this pull request as ready for review April 29, 2026 21:10
@beautifulentropy beautifulentropy requested a review from a team as a code owner April 29, 2026 21:10
@github-actions
Copy link
Copy Markdown
Contributor

@beautifulentropy, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

Copy link
Copy Markdown
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a detailed review of admin/incident_test.go or saa_test.go, but I've done a pass on everything else.

Comment thread cmd/admin/dryrun.go
var _ saAdminClient = (*dryRunSAAdmin)(nil)

func (d dryRunSAAdmin) CreateIncident(_ context.Context, req *sapb.CreateIncidentRequest, _ ...grpc.CallOption) (*sapb.Incident, error) {
d.log.Infof("dry-run: Create incident %q (url=%q, renewBy=%s)", req.SerialTable, req.Url, req.RenewBy.AsTime())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up that the blog PR totally overhauls how the dry-run clients above log, so keep an eye on that depending on whether this lands before or after the blog PR.

Comment thread sa/proto/sa.proto
// incidents row identified by serialTable. An empty url, nil renewBy, and
// nil enabled all mean "leave that field alone"; at least one of them must
// be set.
message UpdateIncidentRequest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to be really fancy, you could use a FieldMask to control which fields get updated, rather than relying on zero-values.

But as cool as FieldMasks are, we don't use them anywhere else, and this probably isn't the time to start. If we want to use them, we should make a concerted effort to use them everywhere. Anyway, just sharing for fun.

Comment thread cmd/admin/incident.go
return errors.New("-parallelism must be > 0")
}

file, err := os.Open(s.serialsFile)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: if s.serialsFile is -, then open stdin. Alternatively, add documentation noting that a user could pass /dev/stdin as the file to read from if they want to stream serials on stdin.

Comment thread sa/proto/sa.proto
Comment on lines +379 to +382
message AddSerialsToIncidentRequest {
string serialTable = 1;
repeated string serial = 2;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got two plurality mismatches here:

  • The message and RPC names both imply a singular incident table, but every message in the stream carries its own serialTable value, which could theoretically change from one message to the next.
  • The repeated field within the message has a singular name, which is non-idiomatic (we pluralize names of repeated fields).

I think there are several options of varying complexity for what to do:

  1. Change the field name to serials and call it a day.
  2. Make the serial field non-repeated, and don't batch lots of serials into a single message.
  3. Split this stream into two different kinds of messages, a "metadata" message and a "serials" message, just like ca.GenerateCRL does.

I don't have strong opinions on which approach you take here. 1 is the smallest diff but only addresses one of the two issues; 2 results in the simplest code by ditching batching but has reduced efficiency and only addresses one of the two issues; 3 is the most comprehensive but is admittedly complex.

Comment thread sa/proto/sa.proto
// StorageAuthorityAdmin exposes those SA methods exclusive to the admin tool.
service StorageAuthorityAdmin {
rpc CreateIncident(CreateIncidentRequest) returns (Incident) {}
rpc UpdateIncident(UpdateIncidentRequest) returns (google.protobuf.Empty) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not return Incident, reflecting the new state of the world, just like Create?

Comment thread sa/db/02-users_next.sql

-- Storage Authority
GRANT SELECT ON * TO 'incidents_sa'@'%';
GRANT SELECT,CREATE,INSERT ON * TO 'incidents_sa_admin'@'%';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: CREATE is the somewhat surprising/unusual permission here, so I'd list it first. (And same in the other file.)

Comment thread cmd/boulder-sa/main.go
cmd.FailOnError(err, "While initializing dbIncidentsMap")
}

dbIncidentsWriteMap := dbIncidentsMap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure this quite makes sense. On line 69, it makes sense to initialize dbReadOnlyMap to dbMap, because the latter can be assumed to have a superset of the permissions that the former needs, so falling back will still allow everything to work. Here, it shouldn't be assumed that dbIncidentsMap can do everything that dbIncidentsWriteMap needs to do.

Comment thread cmd/admin/incident.go
if err != nil {
return fmt.Errorf("opening stream: %w", err)
}
var buf []string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: calling this buf feels weird, because I expect that to be a byte buffer, not a list of strings. Maybe pool or batch or something like that.

Larger comment: how much performance win are we getting from batching both here and at the SA level? Batching here (and unbatching at the SA, before immediately rebatching with a batch size that happens to be the same but can't be assumed to be the same) adds a lot of complexity. We haven't decided that we need that level of complexity for producing CRLs, which have to handle roughly the same number of serials as this, since any serials we load here will end up on CRLs a few minutes later. If a load test has shown that gRPC between the admin tool and the SA is a bottleneck we need to overcome, then great, this is worth it. But if doing 10k-row inserts into the database is still going to be the primary bottleneck, I'd have a preference for keeping this simple.

Comment thread sa/saa.go

var incidentTable string
var buf []string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding an explicit check that the table exists, so we can provide a better error message if it doesn't? Totally optional, feel free to ignore.

@aarongable aarongable requested review from a team and jsha and removed request for a team April 30, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

admin: Add commands to create and enable incident tables

2 participants